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ABSTRACT 

Today's Web of Data is noisy. Linked Data often needs 
extensive preprocessing to enable efficient use of heteroge- 
neous resources. While consistent and valid data provides 
the key to efficient data processing and aggregation we are 
facing two main challenges: (1st) Identification of erroneous 
facts and tracking their origins in dynamically connected 
datasets is a difficult task, and (2nd) efforts in the curation 
of deficient facts in Linked Data are exchanged rather rarely. 
Since erroneous data is often duplicated and (re-) distributed 
by mashup applications it is not only the responsibility of a 
few original publishers to keep their data tidy, but progresses 
to become a mission for all distributers and consumers of 
Linked Data, too. We present a new approach to expose 
and to reuse patches on erroneous data to enhance and to 
add quality information to the Web of Data. The feasibility 
of our approach is demonstrated in the example of a collab- 
orative game that patches statements in DBpedia data and 
provides notifications for relevant changes. 

Categories and Subject Descriptors 

H. 3.5 [Information Storage and Retrieval]: On-line In- 
formation Services — Data sharing; C.2.4 [Distributed Sys- 
tems]: Distributed databases; 1.2.4 [Computing Method- 
ologies] : Knowledge Representation Formalisms and Meth- 
ods 
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I. INTRODUCTION 

With the continuous growth of Linked Data on the World 
Wide Web (WWW) and the increase of web applications 
that consume Linked Data, the quality of Linked Data re- 
sources has become a relevant issue. Recent initiatives (as 
e.g., the Pedantic Web groupQ uncovered various defects 

2 http: //pedantic- web. org/ 



and flaws in Linked Data resources. Apart from structural 
defects, semantic flaws and factual mistakes are hard to de- 
tect by automatic procedures and require updates on the 
schema level, as well as on the data level. Today, semantic 
web applications often rely on a local copy of originally dis- 
tributed Linked Data due to system stability, access time, 
and data integration issues. When a genuine dataset is up- 
dated by the original publisher, changes made to a local copy 
may not be considered and get lost. 

It is in fact a problem that erroneous data is distributed, du- 
plicated, and reused in various semantic web applications, 
but it also opens up opportunities such as crowdsourcing to 
improve data quality. For example, a semantic web applica- 
tion might offer the possibility of user feedback to signalize 
facts, which need to be revised. Then, detected errors could 
be shared with the original data publisher or with other users 
of the dataset. Both would be able to update the identified 
defects. While the need of error correction and data cleans- 
ing has reached the interest of the Linked Data community 
there exists no generally accepted method to expose, adver- 
tise, and retrieve suitable updates for Linked Data resources. 
In order to reuse curation efforts and to realize the vision of 
a collaborative method for error detection and effective ex- 
change of corresponding patches the following requirements 
have to be considered: 

1 . The description of defects and their corresponding pat- 
ches for Linked Data resources should be selectable by 
means of various criteria, as e. g., the scope of a patch, 
their advocates, provenance, and the type of defect to 
select patches efficiently. 

2. The realization of an appropriate workflow that covers 
guidelines to publish detected errors has to notify the 
original publishers as well as other users of a particu- 
lar dataset. To encourage acceptance the execution of 
updates has to be as convenient as possible. 

3. Patches for Linked Data resources should also be pub- 
lished as Linked Data to ease their exchange and to 
make them available for rating, discussions, and reuse. 



The remainder of this paper is organized as follows. An 
overview of related work in the areas of Linked Data cu- 
ration, crowdsourcing approaches, and related ontologies is 
provided in Section [2] In Section [3] a new approach is dis- 
cussed that allows to expose, rate, and select updates for 



particular Linked Data resources with a recommended on- 
tology that is described in Section [4] The feasibility of this 
approach is illustrated exemplary in Section [5] where flaws 
detected with the help of a collaborative data cleansing game 
(WhoKnows? 1 ) are exposed and shared using the herein 
described ontology. Section [6] concludes the paper with some 
general usage guidelines to encourage others to share cura- 
tion efforts with the Linked Data community. 

2. RELATED WORK 

In order to raise quality in Linked Data applications vari- 
ous work has concentrated on syntactical and logical data 
consistency by providing validators for the Semantic Web 
languages RDF and OWL, e.g., W3C's RDF ValidatoiQ 
Vapou^j and OWL Validatoi^] Data quality in Linked Data 
has been criticized, as e. g., Hogan et al. analyzed typical er- 
rors and set up the RDF:Alerts Servic^Jthat detects syntax 
and datatype errors to support publishers of Linked Data 
[2]. Nevertheless, recent analyses provided by LODStats 
show that there is still a vast amount of erroneous Linked 
Datasets available 3 . However, validators are restricted to 
syntactical consistency and thus are not able to fulfill the 
following requirements: 



1. Recognize the inconsistent usage of properties having 
restricted domains and ranges with entities belonging 
to another class. As for example, the RDF triple 
dbp: Apple _Inc . dbo : keyPerson dbp : Chairman . 

is syntactically correct, but the range restrictions of 
dbo : keyPerson implies the entity dbp : Chairman to be 
type of class dbo: Person. While dbp: Chairman is un- 
typed, it is rather a business role and from a user's 
point of view this might be incorrect because an ac- 
tual person entity is expected to be a key person of a 
company. 

2. Recognize false facts that do not correspond to the 
reality, as e.g., the birthdate of a person can be syn- 
tactically totally correct, though factual wrong. 



Although validators are useful to verify syntactical consis- 
tency and correctness, they can not detect semantic or fac- 
tual mistakes that may seem evident to a human. Therefore, 
an effective integration of human intelligence, i. e. crowd- 
sourcing, is required. We address this issue by enabling in- 
teroperable exchange of user feedback on Linked Data facts. 
So far, this concept is sparsely present in Linked Data com- 
munity. 

A number of games with a purpose (GWAP) [2] are available 
that harness human intelligence for various complex tasks by 
providing appropriate incentives, as e.g., fun, competition, 
reputation, etc. [5]. Most GWAP's are social web applica- 
tions designed for the generation of metadata such as for 
multimedia content, but do not necessarily publish Linked 
Data. Harnessing human intelligence for creating semantic 



content has been studied by Siorpaes and Simperl, who pro- 
vide a collection of gameqjto build ontologies, annotating 
videoclips, or matching ontologies [6] |5j. However, these 
games generally concentrate on content enrichment rather 
than on content curat ion. 

WhoKnows? is a simple quiz game in the style of 'Who 
Wants to Be a Millionaire?' published previously by our 
research group, and designed to detect errors and shortcom- 
ings in DBpedia resources 1 . Likewise, RISQ^\ is a game 
in the style of 'Jeopardy!' that focusses on the evaluation 
and ranking of Linked Data properties about famous people. 
Both games are already well accepted but lack a standard- 
ized method to publish the obtained curation efforts. 

We propose a Linked Data approach to describe changes 
made to Linked Data resources in order to make these up- 
dates (additions or deletions) also usable by the community. 
Therefore, we also have to consider work on data provenance 
information and version control on Linked Data resources. 
With respect to provenance, the Provenance Vocabulary [7] 
(prv J] is designed to make the origin, creation, and alter- 
ation of data transparent, but lacks concepts that describe 
the update of particular RDF triples. For the description 
of graph updates several ontologies have been defined, e. g.. 
changeset (csJ^J the graph update ontology (guoj^| G?e/top"| 
and the triplify update vocabulary]^ All of these are lim- 
ited to express changes within semantic data or to describe 
provenance of changes. Both approaches are not designed to 
promote, discuss, and exchange curation efforts, since means 
to express relevance as well as different types of updates do 
not exist. To our best knowledge none of these ontologies 
have been combined or extended to exchange curation infor- 
mation about Linked Data resources. 

In this paper a new approach is proposed to curate Linked 
Data collaboratively, as e.g., flaws in DBpedia resources 
that are hard to detect by automatic procedures. With 
respect to data cleansing of DBpedia resources one could 
argue that curation efforts should be applied directly to the 
original sources, i. e. to the online encyclopedia Wikipedij^l 
The herein proposed approach goes beyond such efforts, as 
e.g., DBpedia Live 8 and can be applied to any Linked 
Data resource. Furthermore, the method proposed in this 
paper supports the common practice to replicate original 
data resources at local repositories, and therefore enables a 
more convenient handling of data in regards of performance, 
stability, and integration issues. 

3. WORKFLOW DESCRIPTION 

By design, data providers and consumers at the Web of Data 
are not always the same party. Linked Data promotes to use 
external data resources within own applications. As a result, 
the datasets are not under control of the agent who employs 
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the dataset, so that he could fix identified inconsistencies 
by himself. An alternative would be to request the dataset 
provider for a fix. Though, nowadays it is still common 
practice to set up an own data store containing web data as 
a local copy, may it be for reasons concerning performance 
or data control. 

Either way, there is currently no standardized method to in- 
form other parties using or distributing a particular data 
set about inconsistencies that have been detected within 
the data. We therefore suggest the PatchR vocabulary (cf. 
Section [4]) that allows to describe patch requests including 
provenance information. 

As shown in the workflow diagram in Fig.^ multiple agents 
independently make use of a particular dataset, accessing it 
directly or as a local copy. Whenever an agent, whereby the 
agent may be human or an algorithm, identifies inconsistent 
facts (RDF triples) in the dataset, he can create a patch 
request. The patch request describes the update that has to 
be performed on the dataset to solve the identified issue. As 
illustrated with the example in Listing [I] an update consists 
of a set of triples to add and/or a set of triples to delete in 
the dataset. 

We suggest to publish patch requests at a patch reposi- 
tory. Though, everybody can easily set up their own repos- 
itory, this would imply unnecessary administration effort. 
We propose a common centralized repository at least for 
each dataset, where multiple agents can report their find- 
ings to gain maximum feedback. Since patch requests are 
encoded as Linked Data themselves, the repository repre- 
sents an RDF graph, so that data about patch requests can 
be retrieved using the SPARQL query language. 

Dataset providers, including providers of local copies can 
use the repository to retrieve individual updates for their 
hosted datasets. Dataset consumers can lookup patches to 
draw conclusions about the quality of a particular dataset. 
With simple modifications the repository can be extended 
to allow commenting and user voting for patches. 

To extract an update for a particular dataset, the reposi- 
tory can be queried for patches having adequate quality con- 
straints according to the required level of trust, as e. g. hav- 
ing a minimum number of supporters. The resulting patches 
can directly be transformed into SPARQL update queries to 
enable a convenient update of the dataset. 

4. DESCRIPTION OF THE PATCH REQUEST 
ONTOLOGY 

The Patch Request Ontology (proj^| subsequently referred 
to as PatchR, provides a straightforward method to share 
user feedback about incorrect or missing RDF triples. By 
wrapping the guo : Update Instruct ion concept adopted from 
the Graph Update Ontolog^\m a pro: Patch a foaf : Agent 
might publish requests to add, modify, or delete particular 
facts from a dataset. Each patch is described by provenance 
information and a dataset to which it applies. Furthermore, 
the ontology supports a simple classification of patches using 
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the pro ipatchType property to allow convenient retrieval of 
common patches. These patch types may refer to commonly 
observed errors, as e. g., encoding problems or factual errors. 

Groups of patches can be created to bundle individual pat- 
ches, as e.g., of a particular service that apply to common 
problems or have relevance only for specialized domains or 
regions. Fig. [2] provides an overview on the main concepts 
of the PatchR ontology, which are described in Table [I] 
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Figure 2: Overview on the patch request ontology 



5. A USE CASE FOR PATCHR 

As a use case for the PatchR ontology we extended Who- 
Knows? pi, an online quiz game in the style of 'Who Wants 
to Be a Millionaire?' that generates questions from DBpe- 
dia fact{f] The game can either be played via FacebooM 
or standalonj^] The game principle is to present multiple 
choice questions to the user that have been generated out of 
facts from DBpedia RDF triples. The player scores points 
by giving correct answers within a limited amount of time 
and loses lives when giving a wrong answer. 

As an example, Fig. [3] shows the question 'English language 
is the language of ...V with the expected answer 'Ohio'. 
The question originates from the triple 

dbp:0hio dbo : language dbp : English. language . 

and is composed by turning the triple's order upside down: 
'Object is the property of: subjectl, subject2, subjects...'. 
The remaining choices are selected from subjects applying 
the same property at least once, but are not linked to the 
given object. When the player submits her answer, the result 
panel will once again show all choices whereby the expected 
answer is highlighted. For each entity used in the question, 
a link to the respective DBpedia page is provided, whereby 
the user might examine the resource. 

So far, it was already possible for the players to simply re- 
port odd or obviously wrong questions by selecting a general 
'Dislike' button. We experienced that 'disliked' questions of- 
ten arose from inconsistency, i. e. wrong or missing triples. 

16 The currently deployed instance of WhoKnows? is based 
on DBpedia version 3. 5 J. 

1 7 http : / / apps . f acebook. com/whoknows_/| 
ls http : //tinyurl . com/whoknowsgame 
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Figure 1: Workflow diagram for creating and applying a patch 



Table 1: Description of properties in the patch request ontology 



Property 


Description 


hasUpdate 


Refers to a guo illpdatelnstruction. There can be only one Updatelnstruction per 




patch. 


hasProvenance 


Refers to the provenance context, where this patch was created. 


memberOf 


Assignment of a patch to a patch group. 


appliesTo 


Refers to a void: Dataset, to allow convenient selection of patches. 


hasAdvocate 


A link to people who support a submitted patch. 


hasCriticiser 


A link to individual entities that disagree with the purpose of this patch. 


patchType 


This property refers to a classification of a patch. A patch can have multiple types but 




should have at least one. 


comment 


An informational comment on the patch. 


status 


The status of a patch, as e.g., 'active' or 'resolved'. 



Therefore, this feature has been extended in a way that play- 
ers can specify the particular fact, which they think to be 
incorrect by selecting it from a given set of potential in- 
consistencies. Those potential inconsistencies are presented 
to the user as natural language sentences. Fig. ^] shows a 
screenshot of this refinement panel. Sentences of the form 
'Object is not a property of subject. ' indicate a wrong fact 
in the dataset, while sentences of the form 'Object is also a 
property of subject. ' indicate a missing fact. From this user 
vote the system generates a patch request for either 

• deleting one or several triples or 

• inserting one or several triples in the underlying knowl- 
edge base. 



Listing 1: An example patch request 
i repo : Patch_15 a pro: Patch ; 



2 pro : hasUpdate [ 

3 a guo : Updatelnstruct ion ; 

4 guo : t arget _gr aph <http://dbpedia.org/> 

5 guo : t arget_sub j ect dbp : Oregon ; 

6 guo : insert [ 

7 dbo : language dbp : Engl i sh_language ] 
] ; 

9 pro : hasAdvocate repo : Player_25 ; 

10 pro : appliesTo 

11 <http : // dbpedia . org/void.ttl#DBpedia> 

12 pro : status "active" ; 

13 pro : hasProvenance [ 

14 a prv : Dat aCreat ion ; 

is prv : perf ormedBy repo : WhoKnows ; 

16 prv : involvedAct or repo : Player_25 ; 

17 prv : perf ormedAt " . . . " ~ ~ xsd : dat eTime ] 



Listing 1 shows an example description of a patch request 
created by the game demanding the insertion of the triple 

dbp: Oregon dbo: language dbp : English. language . 

The created patch requests are immediately posted to a 
patch repository implemented as a standard triplestore. The 
patch requests collected in the repository are pub licly avail- 
able and can be accessed via a SPARQL endpointp^ 

ls http: //purl . org/hpi/patchr-sparql 



To illustrate the collection of patch requests submitted to 
the repository, we implemented a user interfac j^] that cre- 
ates reports about the most recent and most popular patch 
requests. Furthermore, inconsistencies for single entities of 
the DBpedia dataset can be displayed. 

Patching DBpedia can be regarded as a special case, since 
this data is based on editable Wikipedia pages. Identified 
problems should sustainably be fixed directly in Wikipedia, 

' c http : //purl . org/hpi/patchrui 
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Figure 3: Screenshot and triples used to generate a 
One- To-One question 



so they won't occur in future DBpedia releases. Therefore, 
we also provide a link to the particular Wikipedia page, 
where users might solve the inconsistency directly by editing 
the relevant section directly. 

Data providers interested in patches for the DBpedia dataset 
can query the patch collections for those patches they think 
are evident enough to apply to their local graph. To ap- 
ply those patches directly onto a local triplestore, a set of 
SPARQL update queries can be generated. Listing 2 shows 
the update query corresponding to the preceding example. 



Listing 2: The appropriate SPARQL update query 
for Listing 1 

1 INSERT DATA INTO <http://dbpedia.org/> { 

2 dbp : Oregon 

3 dbo : language dbp : English_language . 

4 } 




English language is a/the language of p^j^ 



l~l English language is not a/the language of Ohio. 

□ English language is nel a/the language of anything . 
El English language is also a/the language of Oregon. 

□ there is no language of Oregon at all. 
n English language is dso a/the language of Dances 

with Wolves. 

n there is no language of Dances with Wolves at all. 
the question is just too hard! 



Figure 4: Screenshot of WhoKnows?' refinement 
panel. 



Linked Data resources. 

As in our use case, some datasets are based on genuine in- 
formation sources that are worth to be corrected directly at 
their origin to avoid the reoccurrence of the same errors in fu- 
ture releases. We also encourage such efforts for Wikipedia. 
Since Wikipedia pages must be edited by humans the patch 
repository's user interface supports web users by providing 
direct links to the respective DBpedia and Wikipedia pages 
that contain the facts to be improved. 

However, our approach is not limited to the given showcase, 
but can be applied to any dataset that obeys Linked Data 
principles. Moreover, the approach does not rely on sophis- 
ticated crowdsourcing methods, but can also be adopted by 
algorithmic data curat ion systems. Therefore, the presented 
application can be of concern for various data providers that 
are interested in data curation issues. In general each ag- 
gregator of Linked Data can help to leverage structural and 
factual quality of the semantic web. The following steps are 
necessary for active participation: 



6. CONCLUSION AND OUTLOOK 

In this paper a collaborative approach is demonstrated that 
can leverage the quality in Linked Data applications by com- 
bining crowdsourcing methods with Linked Data principles. 
We have illustrated the feasability of this approach by a use 
case on top of a quiz game that collects data about possi- 
ble flaws in DBpedia and publishes patches using a patch 
request ontology. Finally, a patch repository is presented 
allowing a convenient selection of appropriate updates on 



1. Identify potential flaws in original or aggregated data. 

2. Implement facilities to gather user feedback. 

3. Serialize identified flaws and corresponding updates us- 
ing the PatchR ontology. 

4. Publish these patches within an appropriate repository 
that can be publicly accessed. 



In regards to managing distributed information on patches 
we suggest a rather centralized setting, where major dataset 
providers rely on dedicated patch repositories to obtain pat- 
ches for their particular dataset. Further auxiliary tasks 
for effective synchronization of patches such as further stan- 
dardization and management of trust are not covered in this 
publication and subject of future research. 

The repository currently represents merely a proof of con- 
cept and various extensions are conceivable. Further work 
will include the implementation of advanced trust and ac- 
cess control mechanisms as well as validity checking. Fea- 
tures like rating, feedback, and reputation management are 
necessary to provide appropriate incentives in the long run. 
A pingback mechanism might be valuable to inform data 
providers about recently created patch requests concerning 
their datasets. Finally, also the implementation of an API 
can be considered to ease the publication of patches. 
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